perf(player): p0-1b perf tests for fps, scrub latency, and media sync drift#400
Conversation
jrusso1020
left a comment
There was a problem hiding this comment.
Scenario design is careful and the methodology notes in each file are a good sign — the "alternate forward/backward seek targets so the rAF watcher doesn't match a stale getTime() value" trick in 04-scrub.ts is exactly the kind of detail that makes or breaks a microbenchmark. Same for the "install rAF watcher before play() and pause() in the same tick to freeze at the captured time" pattern in what I assume 05-drift.ts uses.
The scrub_latency_p95_inline vs scrub_latency_p95_isolated split directly pins the value of #397's sync path as a measurable metric — monkey-patching _trySyncSeek to () => false to force the postMessage path in the same page load is a clean way to separate the modes without needing two separate runs.
Three non-blocking observations:
-
With
runs: 3and 10 seeks per mode per run, that's 30 samples per mode per shard for p95. That's on the edge of "stable enough" — a single outlier at index 28/29 can swing the p95 by tens of ms. Not a blocker (you're in measure mode), but if you see p95 flapping in the first few enforcement cycles, bumping toruns: 5is the cheapest fix. -
MATCH_TOLERANCE_S = 0.05is generous. On tight-latency scrubs, a 50ms tolerance window between seek command and confirmation paint could mask a legitimate regression where the measured latency is ~30ms but the tolerance swallows the last rAF. Worth revisiting once real baselines land. -
The drift scenario (which I haven't read line-by-line) is the one most likely to produce flaky signal, since it's inherently long-running. Keep an eye on its coefficient of variation over the first week — if it's >20%, that's the signal to tighten the
driftMaxMs/driftP95Msbaselines and investigate whether there's a non-deterministic timing source in the runtime.
Approved.
— Rames Jusso
acdf9af to
111e128
Compare
725bc89 to
0af9ce7
Compare
miguel-heygen
left a comment
There was a problem hiding this comment.
I pulled this stack into a local worktree, ran the perf harness end-to-end, and compared the top branch against main by swapping the built player/runtime artifacts under the same harness. The scrub/drift gains are real but modest: inline scrub p95 improved from 7.2ms on main to 7.0ms here, isolated scrub p95 improved from 8.2ms to 7.1ms, drift max improved from 25.67ms to 24.67ms, and drift p95 improved from 25.33ms to 24.0ms.
The blocking issue is the FPS scenario. packages/player/tests/perf/scenarios/02-fps.ts is measuring raw requestAnimationFrame cadence and then comparing it against a 60fps target. On my machine both main and this branch reported ~120fps for the same fixture, which means the metric is saturating to browser/display cadence rather than proving “player sustained 60fps playback.” With the current implementation, a high-refresh runner can make fpsMin: 55 in baseline.json look comfortably green without actually telling us whether playback stayed near the intended 60fps budget.
I’d like to see this normalized to a refresh-rate-independent signal before we merge the scenario as a regression gate. Concretely: either derive the metric from missed target intervals against the 60fps composition clock, or capture an effective render cadence that is explicitly bounded to the fixture/runtime target instead of host rAF speed.
Once that part is fixed, I’m comfortable with the rest of the scenario design. The alternating seek targets, iframe-side timing, and drift sampling approach all looked sound in local runs.
111e128 to
306c164
Compare
0af9ce7 to
433b609
Compare
|
Following up on @miguel-heygen's stress-test finding — agreed that the current FPS scenario saturates at the runner's display refresh. On a 120Hz runner, A few approaches that remove the refresh-rate dependency: 1. Composition-time-advanced-per-wall-second (my first choice). In the iframe, sample Bonus: this is the metric that actually answers "did the composition play at its intended speed," which is the user-observable thing. Display refresh only matters if it's lower than the composition fps — at 60Hz with a 60fps composition the metric would still read ~1.0; at 30Hz displaying a 60fps composition it'd read ~0.5 and legitimately flag the bad experience. 2. Missed-deadline rate. In the iframe's rAF loop, count ticks where the delta since the previous tick exceeded 3. PerformanceObserver + frame-timing. Option 1 is the simplest and the most directly answers "is the player sustaining playback" — the metric has a physical interpretation rather than being a threshold crossing. Happy to re-review once this lands. Baseline-wise: Everything else in the scenario design — alternating seek targets, in-tick pause, drift sampling — held up under Miguel's local run and my static read, so I think just the fps metric needs rework. The rest of my non-blocking notes (samples/shard, tolerance windows) stand but are secondary. — Rames Jusso |
|
@jrusso1020 @miguel-heygen — thanks for the careful read. Blocker addressed plus the three non-blocking notes: Blocking — FPS metric saturating at display refresh (jrusso1020): addressed in The scenario no longer measures The non-blocking observations:
Documented
Same treatment — documented in
Added the CV log line in Nothing else outstanding. |
miguel-heygen
left a comment
There was a problem hiding this comment.
Re-checked the live head and validated the requested-change blocker is fixed. The FPS scenario now measures composition-time advancement rather than raw rAF cadence, and the local player perf run passed on the reviewed head.
433b609 to
dbe8090
Compare
2256f55 to
3c347f7
Compare
jrusso1020
left a comment
There was a problem hiding this comment.
Re-reviewed the incremental changes at head (3c347f7c) since my prior approval on commit acdf9af3. The FPS scenario has been redesigned to address Miguel's blocker, and my earlier non-blocking notes got explicit treatment — clean re-approval.
FPS metric replacement (addressing the 120Hz-saturates blocker):
02-fps.ts no longer counts raw requestAnimationFrame ticks. New design:
- Samples
__player.getTime()at a fixed wall-clock cadence (setInterval(100ms)) for a 5s window. - Emits
composition_time_advancement_ratio_min = (getTime(end) - getTime(start)) / wallSeconds. - Reads ~1.0 when the player keeps up with its intended speed, falls below 1.0 when it stalls. Refresh-rate independent by construction — both numerator and denominator are wall-clock-derived, neither is a frame count, so a 60 / 120 / 240Hz runner converges to the same value.
baseline.jsondropsfpsMin: 55, addscompositionTimeAdvancementRatioMin: 0.95. ThePerfBaselineJSDoc clearly documents the semantic shift.
The "reset sample buffer after isPlaying() === true" gotcha is preserved in the new implementation — samples captured during the postMessage play ramp-up would compare a stale comp=0 against an advancing wall clock and bias the ratio toward 0. Good.
Non-FPS scenarios unchanged from prior approval, and my non-blocking notes got explicit TODOs:
04-scrub.ts::MATCH_TOLERANCE_S = 0.05now has an expanded JSDoc explaining the three sources of slack (frame quantization on postMessage, sub-frame intra-clip advance, runner jitter) plus a TODO to tighten to 16ms once baselines land and optionally split per-mode.05-drift.tsnow computes + logs CV (stddev/mean) alongside max and p95, with a TODO to decide whether to publish CV as a tracked-but-ungated baseline after 2 weeks of CI data. That's exactly the early-warning signal I asked for — if CV climbs while max/p95 stay green, the 50mssetIntervalassumption has shifted and the baselines are about to flake.
Approving the incremental work.
— Review by pr-review
dbe8090 to
4f09c94
Compare
3c347f7 to
1f13f80
Compare
4f09c94 to
a08c782
Compare
…tio metric Addresses blocking PR feedback on #400 from miguel-heygen and jrusso1020: the previous FPS metric measured raw rAF cadence and was refresh-rate dependent (a 30fps composition would always 'pass' a 60Hz refresh rate but the metric reported on rAF, not the composition). - 02-fps.ts: re-implemented to sample __player.getTime() at 100ms wall intervals and emit (deltaCompTime / wallSeconds). New metric is refresh-rate independent and measures what we actually care about: whether the player keeps up with composition-time playback. - perf-gate.ts: replaced fpsMin / droppedFramesMax with compositionTimeAdvancementRatioMin (higher-is-better, target 0.95). - baseline.json: updated to match the new metric key + threshold. - 04-scrub.ts: documented MATCH_TOLERANCE_S rationale (frame quantization on postMessage, sub-frame intra-clip advance, runner jitter) + TODO to tighten once we have CI baseline data. - 05-drift.ts: log coefficient of variation as a soft monitoring signal (not gated) + TODO to decide whether to publish it as a tracked metric. - index.ts: documented DEFAULT_RUNS rationale (load=5 because p95 over n=3 is just max; fps/scrub/drift=3 because they pool samples across runs) + TODO to revisit fps=3 after collecting CI baseline data. - index.ts: removed dead reference to docs/internal/player-perf-baselines.md.
621a276 to
84efd8a
Compare
2dc4b8f to
407813c
Compare
84efd8a to
5a93e44
Compare
407813c to
ab395e9
Compare
…tio metric Addresses blocking PR feedback on #400 from miguel-heygen and jrusso1020: the previous FPS metric measured raw rAF cadence and was refresh-rate dependent (a 30fps composition would always 'pass' a 60Hz refresh rate but the metric reported on rAF, not the composition). - 02-fps.ts: re-implemented to sample __player.getTime() at 100ms wall intervals and emit (deltaCompTime / wallSeconds). New metric is refresh-rate independent and measures what we actually care about: whether the player keeps up with composition-time playback. - perf-gate.ts: replaced fpsMin / droppedFramesMax with compositionTimeAdvancementRatioMin (higher-is-better, target 0.95). - baseline.json: updated to match the new metric key + threshold. - 04-scrub.ts: documented MATCH_TOLERANCE_S rationale (frame quantization on postMessage, sub-frame intra-clip advance, runner jitter) + TODO to tighten once we have CI baseline data. - 05-drift.ts: log coefficient of variation as a soft monitoring signal (not gated) + TODO to decide whether to publish it as a tracked metric. - index.ts: documented DEFAULT_RUNS rationale (load=5 because p95 over n=3 is just max; fps/scrub/drift=3 because they pool samples across runs) + TODO to revisit fps=3 after collecting CI baseline data. - index.ts: removed dead reference to docs/internal/player-perf-baselines.md.
5a93e44 to
871c986
Compare
ab395e9 to
70e61ca
Compare
…tio metric Addresses blocking PR feedback on #400 from miguel-heygen and jrusso1020: the previous FPS metric measured raw rAF cadence and was refresh-rate dependent (a 30fps composition would always 'pass' a 60Hz refresh rate but the metric reported on rAF, not the composition). - 02-fps.ts: re-implemented to sample __player.getTime() at 100ms wall intervals and emit (deltaCompTime / wallSeconds). New metric is refresh-rate independent and measures what we actually care about: whether the player keeps up with composition-time playback. - perf-gate.ts: replaced fpsMin / droppedFramesMax with compositionTimeAdvancementRatioMin (higher-is-better, target 0.95). - baseline.json: updated to match the new metric key + threshold. - 04-scrub.ts: documented MATCH_TOLERANCE_S rationale (frame quantization on postMessage, sub-frame intra-clip advance, runner jitter) + TODO to tighten once we have CI baseline data. - 05-drift.ts: log coefficient of variation as a soft monitoring signal (not gated) + TODO to decide whether to publish it as a tracked metric. - index.ts: documented DEFAULT_RUNS rationale (load=5 because p95 over n=3 is just max; fps/scrub/drift=3 because they pool samples across runs) + TODO to revisit fps=3 after collecting CI baseline data. - index.ts: removed dead reference to docs/internal/player-perf-baselines.md.
871c986 to
224503d
Compare
70e61ca to
658e30c
Compare
…tio metric Addresses blocking PR feedback on #400 from miguel-heygen and jrusso1020: the previous FPS metric measured raw rAF cadence and was refresh-rate dependent (a 30fps composition would always 'pass' a 60Hz refresh rate but the metric reported on rAF, not the composition). - 02-fps.ts: re-implemented to sample __player.getTime() at 100ms wall intervals and emit (deltaCompTime / wallSeconds). New metric is refresh-rate independent and measures what we actually care about: whether the player keeps up with composition-time playback. - perf-gate.ts: replaced fpsMin / droppedFramesMax with compositionTimeAdvancementRatioMin (higher-is-better, target 0.95). - baseline.json: updated to match the new metric key + threshold. - 04-scrub.ts: documented MATCH_TOLERANCE_S rationale (frame quantization on postMessage, sub-frame intra-clip advance, runner jitter) + TODO to tighten once we have CI baseline data. - 05-drift.ts: log coefficient of variation as a soft monitoring signal (not gated) + TODO to decide whether to publish it as a tracked metric. - index.ts: documented DEFAULT_RUNS rationale (load=5 because p95 over n=3 is just max; fps/scrub/drift=3 because they pool samples across runs) + TODO to revisit fps=3 after collecting CI baseline data. - index.ts: removed dead reference to docs/internal/player-perf-baselines.md.
224503d to
f637194
Compare
db957ee to
cc6039f
Compare
…tio metric Addresses blocking PR feedback on #400 from miguel-heygen and jrusso1020: the previous FPS metric measured raw rAF cadence and was refresh-rate dependent (a 30fps composition would always 'pass' a 60Hz refresh rate but the metric reported on rAF, not the composition). - 02-fps.ts: re-implemented to sample __player.getTime() at 100ms wall intervals and emit (deltaCompTime / wallSeconds). New metric is refresh-rate independent and measures what we actually care about: whether the player keeps up with composition-time playback. - perf-gate.ts: replaced fpsMin / droppedFramesMax with compositionTimeAdvancementRatioMin (higher-is-better, target 0.95). - baseline.json: updated to match the new metric key + threshold. - 04-scrub.ts: documented MATCH_TOLERANCE_S rationale (frame quantization on postMessage, sub-frame intra-clip advance, runner jitter) + TODO to tighten once we have CI baseline data. - 05-drift.ts: log coefficient of variation as a soft monitoring signal (not gated) + TODO to decide whether to publish it as a tracked metric. - index.ts: documented DEFAULT_RUNS rationale (load=5 because p95 over n=3 is just max; fps/scrub/drift=3 because they pool samples across runs) + TODO to revisit fps=3 after collecting CI baseline data. - index.ts: removed dead reference to docs/internal/player-perf-baselines.md.
f637194 to
23e3dcb
Compare
## Summary First slice of `P0-1` from the player perf proposal: lays the foundation for a player perf gate so later PRs can plug in fps / scrub / drift / parity scenarios without rebuilding infrastructure. Ships one smoke scenario (`03-load`, cold + warm composition load) to prove the gate end-to-end on real numbers. ## Why There was no automated way to catch player perf regressions. Every perf concern in the existing proposal — composition load time, sustained FPS, scrub p95, mirror-clock drift, live-vs-seek parity — needs the same plumbing: a same-origin harness, a Puppeteer runner, a baseline file, a gate that emits structured results, and a CI workflow that runs the right scenarios on the right changes. Building that up-front in one reviewable PR lets every subsequent perf PR (`P0-1b`, `P0-1c`, and beyond) be a 100-line scenario file plus a baseline entry instead of re-litigating the framework. ## What changed ### Harness — `packages/player/tests/perf/server.ts` - `Bun.serve` on a free port, single same-origin host for the player IIFE bundle, hyperframe runtime, GSAP from `node_modules`, and fixture HTML. - Same-origin matters: cross-origin would force every probe through `postMessage`, hiding bugs and inflating numbers in ways production never sees. Tests should measure the path the studio editor actually takes. - Routes: - `/player.js` → built IIFE bundle (rebuilt on demand). - `/vendor/runtime.js`, `/vendor/gsap.min.js` → resolved from `node_modules` so fixtures don't need to ship copies. - `/fixtures/*` → fixture HTML. ### Runner — `packages/player/tests/perf/runner.ts` - `puppeteer-core` thin wrappers (`launchBrowser`, `loadHostPage`). - Uses the system Chrome detected by `setup-chrome` in CI rather than the bundled puppeteer revision — keeps the action smaller, lets us pin Chrome version policy at the workflow level, and matches what users actually run. ### Gate — `packages/player/tests/perf/perf-gate.ts` + `baseline.json` - Loads `baseline.json` (initial budgets: cold/warm comp load, fps, scrub p95 isolated/inline, drift max/p95) with a 10% `allowedRegressionRatio`. - Per-metric direction (`lower-is-better` / `higher-is-better`) so the same evaluator handles latency and throughput. - Returns a structured `GateReport` consumed by both the CLI (table output) and `metrics.json` (CI artifact). - Two modes: `measure` (log only — used during the rollout) and `enforce` (fail the build) — flip per-metric once we trust the signal, without touching the harness. ### CLI orchestrator — `packages/player/tests/perf/index.ts` - Parses `--mode` / `--scenarios` / `--runs` / `--fixture` in both space- and equals-separated form (so `--scenarios fps,scrub` and `--scenarios=fps,scrub` both work — matches what humans type and what GitHub Actions emits). - Runs scenarios, runs the gate, and **always** writes `results/metrics.json` with schema version, git SHA, metrics, and gate rows — so failed runs are still investigable from the artifact alone. ### Fixture + smoke scenario - `fixtures/gsap-heavy/index.html`: 200 stagger-animated tiles, no media. Heavy enough to make load time meaningful, light enough to be deterministic. - `scenarios/03-load.ts`: cold + warm composition load. Measures from navigation start to player `ready` event, reports p95 across runs. ### CI — `.github/workflows/player-perf.yml` - `paths-filter` on `player` / `core` / `runtime` — perf only runs when something that could move the needle actually changed. - Sets up bun + node + chrome, runs perf in `measure` mode on a shard matrix (so future scenarios shard naturally), uploads `metrics.json` artifacts, and a summary job aggregates shard results into a single PR comment. ### Wiring - `packages/player`: `puppeteer-core`, `gsap`, `@types/bun` devDeps; typecheck extended to cover the perf `tsconfig`; new `perf` script. - Root `package.json`: `player:perf` workspace script so `bun run player:perf` runs the whole suite locally with the same flags CI uses. - `.gitignore`: `packages/player/tests/perf/results/`. - Separate `tests/perf/tsconfig.json` so test code doesn't pollute the package `rootDir` while still being typechecked. ## Test plan - [x] Local: `bun run player:perf` passes — cold p95 ≈ 386 ms, warm p95 ≈ 375 ms, both well under the seeded baselines. - [x] Typecheck, lint, format pass on the perf workspace. - [x] Existing player unit tests (71/71) still green. - [ ] First CI run after merge will be the real signal: confirms `setup-chrome` works on hosted runners, the shard matrix wires up, and `metrics.json` artifacts upload. ## Stack Step `P0-1a` of the player perf proposal. The next two slices are content-only — they don't touch the harness: - `P0-1b` (#400): adds `02-fps`, `04-scrub`, `05-drift` scenarios on a 10-video-grid fixture. - `P0-1c` (#401): adds `06-parity` (live playback vs. synchronously-seeked reference, compared via SSIM). Wiring this gate up first means each follow-up is a self-contained scenario file + baseline row + workflow shard.
cc6039f to
10d2725
Compare
…tio metric Addresses blocking PR feedback on #400 from miguel-heygen and jrusso1020: the previous FPS metric measured raw rAF cadence and was refresh-rate dependent (a 30fps composition would always 'pass' a 60Hz refresh rate but the metric reported on rAF, not the composition). - 02-fps.ts: re-implemented to sample __player.getTime() at 100ms wall intervals and emit (deltaCompTime / wallSeconds). New metric is refresh-rate independent and measures what we actually care about: whether the player keeps up with composition-time playback. - perf-gate.ts: replaced fpsMin / droppedFramesMax with compositionTimeAdvancementRatioMin (higher-is-better, target 0.95). - baseline.json: updated to match the new metric key + threshold. - 04-scrub.ts: documented MATCH_TOLERANCE_S rationale (frame quantization on postMessage, sub-frame intra-clip advance, runner jitter) + TODO to tighten once we have CI baseline data. - 05-drift.ts: log coefficient of variation as a soft monitoring signal (not gated) + TODO to decide whether to publish it as a tracked metric. - index.ts: documented DEFAULT_RUNS rationale (load=5 because p95 over n=3 is just max; fps/scrub/drift=3 because they pool samples across runs) + TODO to revisit fps=3 after collecting CI baseline data. - index.ts: removed dead reference to docs/internal/player-perf-baselines.md.
23e3dcb to
33150d2
Compare
…tio metric Addresses blocking PR feedback on #400 from miguel-heygen and jrusso1020: the previous FPS metric measured raw rAF cadence and was refresh-rate dependent (a 30fps composition would always 'pass' a 60Hz refresh rate but the metric reported on rAF, not the composition). - 02-fps.ts: re-implemented to sample __player.getTime() at 100ms wall intervals and emit (deltaCompTime / wallSeconds). New metric is refresh-rate independent and measures what we actually care about: whether the player keeps up with composition-time playback. - perf-gate.ts: replaced fpsMin / droppedFramesMax with compositionTimeAdvancementRatioMin (higher-is-better, target 0.95). - baseline.json: updated to match the new metric key + threshold. - 04-scrub.ts: documented MATCH_TOLERANCE_S rationale (frame quantization on postMessage, sub-frame intra-clip advance, runner jitter) + TODO to tighten once we have CI baseline data. - 05-drift.ts: log coefficient of variation as a soft monitoring signal (not gated) + TODO to decide whether to publish it as a tracked metric. - index.ts: documented DEFAULT_RUNS rationale (load=5 because p95 over n=3 is just max; fps/scrub/drift=3 because they pool samples across runs) + TODO to revisit fps=3 after collecting CI baseline data. - index.ts: removed dead reference to docs/internal/player-perf-baselines.md.
33150d2 to
b9fd169
Compare
Merge activity
|
## Summary Adds **scenario 06: live-playback parity** — the third and final tranche of the P0-1 perf-test buildout (`p0-1a` infra → `p0-1b` fps/scrub/drift → this). The scenario plays the `gsap-heavy` fixture, freezes it mid-animation, screenshots the live frame, then synchronously seeks the same player back to that exact timestamp and screenshots the reference. The two PNGs are diffed with `ffmpeg -lavfi ssim` and the resulting average SSIM is emitted as `parity_ssim_min`. Baseline gate: **SSIM ≥ 0.95**. This pins the player's two frame-production paths (the runtime's animation loop vs. `_trySyncSeek`) to each other visually, so any future drift between scrub and playback fails CI instead of silently shipping. ## Motivation `<hyperframes-player>` produces frames two different ways: 1. **Live playback** — the runtime's animation loop advances the GSAP timeline frame-by-frame. 2. **Synchronous seek** (`_trySyncSeek`, landed in #397) — for same-origin embeds, the player calls into the iframe runtime's `seek()` directly and asks for a specific time. These paths must agree. If they don't — different rounding, different sub-frame sampling, different state ordering — scrubbing a paused composition shows different pixels than a paused-during-playback frame at the same time. That's a class of bug that only surfaces visually, never in unit tests, and only at specific timestamps where many things are mid-flight. `gsap-heavy` is a 10s composition with 60 tiles each running a staggered 4s out-and-back tween. At t=5.0s a large fraction of those tiles are mid-flight, so the rendered frame has many distinct, position-sensitive pixels — the worst-case input for any sub-frame disagreement. If the two paths produce identical pixels here, they'll produce identical pixels everywhere that matters. ## What changed - **`packages/player/tests/perf/scenarios/06-parity.ts`** — new scenario (~340 lines). Owns capture, seek, screenshot, SSIM, artifact persistence, and aggregation. - **`packages/player/tests/perf/index.ts`** — register `parity` as a scenario id, default-runs = 3, dispatch to `runParity`, include in the default scenario list. - **`packages/player/tests/perf/perf-gate.ts`** — extend `PerfBaseline` with `paritySsimMin`. - **`packages/player/tests/perf/baseline.json`** — `paritySsimMin: 0.95`. - **`.github/workflows/player-perf.yml`** — add a `parity` shard (3 runs) to the matrix alongside `load` / `fps` / `scrub` / `drift`. ## How the scenario works The hard part is making the two captures land on the *exact same timestamp* without trusting `postMessage` round-trips or arbitrary `setTimeout` settling. 1. **Install an iframe-side rAF watcher** before issuing `play()`. The watcher polls `__player.getTime()` every animation frame and, the first time `getTime() >= 5.0`, calls `__player.pause()` *from inside the same rAF tick*. `pause()` is synchronous (it calls `timeline.pause()`), so the timeline freezes at exactly that `getTime()` value with no postMessage round-trip. The watcher's Promise resolves with that frozen value as the canonical `T_actual` for the run. 2. **Confirm `isPlaying() === true`** via `frame.waitForFunction` before awaiting the watcher. Without this, the test can hang if `play()` hasn't kicked the timeline yet. 3. **Wait for paint** — two `requestAnimationFrame` ticks on the host page. The first flushes pending style/layout, the second guarantees a painted compositor commit. Same paint-settlement pattern as `packages/producer/src/parity-harness.ts`. 4. **Screenshot the live frame** — `page.screenshot({ type: "png" })`. 5. **Synchronously seek to `T_actual`** — call `el.seek(capturedTime)` on the host page. The player's public `seek()` calls `_trySyncSeek` which (same-origin) calls `__player.seek()` synchronously, so no postMessage await is needed. The runtime's deterministic `seek()` rebuilds frame state at exactly the requested time. 6. **Wait for paint** again, screenshot the reference frame. 7. **Diff with ffmpeg** — `ffmpeg -hide_banner -i reference.png -i actual.png -lavfi ssim -f null -`. ffmpeg writes per-channel + overall SSIM to stderr; we parse the `All:` value, clamp at 1.0 (ffmpeg occasionally reports 1.000001 on identical inputs), and treat it as the run's score. 8. **Persist artifacts** under `tests/perf/results/parity/run-N/` (`actual.png`, `reference.png`, `captured-time.txt`) so CI can upload them and so a failed run is locally reproducible. Directory is already gitignored via the existing `packages/player/tests/perf/results/` rule. ### Aggregation `min()` across runs, **not** mean. We want the *worst observed* parity to pass the gate so a single bad run can't get masked by averaging. Both per-run scores and the aggregate are logged. ### Output metric | name | direction | baseline | |-------------------|------------------|----------------------| | `parity_ssim_min` | higher-is-better | `paritySsimMin: 0.95` | With deterministic rendering enabled in the runner, identical pixels produce SSIM very close to 1.0; the 0.95 threshold leaves headroom for legitimate fixture-level noise (font hinting, GPU compositor variance) while still catching any real disagreement between the two paths. ## Test plan - `bun run player:perf -- --scenarios=parity --runs=3` locally on `gsap-heavy` — passes with SSIM ≈ 0.999 across all 3 runs. - Inspected `results/parity/run-1/actual.png` and `reference.png` side-by-side — visually identical. - Inspected `captured-time.txt` to confirm `T_actual` lands just past 5.0s (within one frame). - Sanity test: temporarily forced a 1-frame offset between live and reference capture; SSIM dropped well below 0.95 as expected, confirming the threshold catches real drift. - CI: `parity` shard added alongside the existing `load` / `fps` / `scrub` / `drift` shards; same `measure`-mode / artifact-upload / aggregation flow. - `bunx oxlint` and `bunx oxfmt --check` clean on the new scenario. ## Stack This is the top of the perf stack: 1. #393 `perf/x-1-emit-performance-metric` — performance.measure() emission 2. #394 `perf/p1-1-share-player-styles-via-adopted-stylesheets` — adopted stylesheets 3. #395 `perf/p1-2-scope-media-mutation-observer` — scoped MutationObserver 4. #396 `perf/p1-4-coalesce-mirror-parent-media-time` — coalesce currentTime writes 5. #397 `perf/p3-1-sync-seek-same-origin` — synchronous seek path (the path this PR pins) 6. #398 `perf/p3-2-srcdoc-composition-switching` — srcdoc switching 7. #399 `perf/p0-1a-perf-test-infra` — server, runner, perf-gate, CI 8. #400 `perf/p0-1b-perf-tests-for-fps-scrub-drift` — fps / scrub / drift scenarios 9. **#401 `perf/p0-1c-live-playback-parity-test` ← you are here** With this PR landed the perf harness covers all five proposal scenarios: `load`, `fps`, `scrub`, `drift`, `parity`.

Summary
Second slice of
P0-1from the player perf proposal: plugs the three steady-state scenarios — sustained playback FPS, scrub latency, and media-sync drift — into the perf gate that landed in #399. Adds the multi-video fixture they all share, wires three new shards into CI, and seeds one new baseline (droppedFramesMax).Why
#399 stood up the harness and proved it with a single load-time scenario. By itself that's enough to catch regressions in initial composition setup, but it can't catch the things players actually fail at in production:
Each of these is a target metric in the proposal with a concrete budget. This PR turns those budgets into gated CI signals and produces continuous data for them on every player/core/runtime change.
What changed
Fixture —
packages/player/tests/perf/fixtures/10-video-grid/index.html: 10-second composition, 1920×1080, 30 fps, with 10 simultaneously-decoding video tiles in a 5×2 grid plus a subtle GSAP scale "breath" on each tile (so the rAF/RVFC loops have real work to do without GSAP dominating the budget the decoder needs).sample.mp4: small (~190 KB) clip checked in so the fixture is hermetic — no external CDN dependency, identical bytes on every run.data-composition-id="main"host pattern asgsap-heavy, so the existing harness loader works without changes.02-fps.ts— sustained playback frame rate10-video-grid, callsplayer.play(), samplesrequestAnimationFramecallbacks inside the iframe for 5 s.play(), wait for__player.isPlaying() === true, then reset the sample buffer — otherwise the postMessage round-trip ramp-up window drags the average down by 5–10 fps.(samples − 1) / (lastTs − firstTs in s); uses rAF timestamps (the same ones the compositor saw) rather than wall-clocksetTimeout, so we're measuring real frame production.min(fps)andmax(droppedFrames)— worst case wins, since the proposal asserts a floor on fps and a ceiling on drops.playback_fps_min(higher-is-better, baselinefpsMin = 55) andplayback_dropped_frames_max(lower-is-better, baselinedroppedFramesMax = 3).04-scrub.ts— scrub latency, inline + isolated10-video-grid, pauses, then issues 10 seek calls in two batches: first the synchronous inline path (<hyperframes-player>'s default same-origin_trySyncSeek), then the isolated path (forced by replacing_trySyncSeekwith() => false, which makes the player fall back to the postMessage_sendControl("seek")bridge that cross-origin embeds and pre-feat(player): synchronous seek() API with same-origin detection #397 builds use).__player.getTime()until it's withinMATCH_TOLERANCE_S = 0.05 sof the requested target. Tolerance exists because the postMessage bridge converts seconds → frame number → seconds, and that round-trip can introduce sub-frame quantization drift even for targets on the canonical fps grid.performance.timeOrigin + performance.now()in both contexts.timeOriginis consistent across same-process frames, sot1 − t0is a true wall-clock latency, not a host-only or iframe-only stopwatch.1.0, 7.0, 2.0, 8.0, 3.0, 9.0, 4.0, 6.0, 5.0, 0.5) so no two consecutive seeks land near each other — protects the rAF watcher from matching against a stalegetTime()value before the seek command is processed.percentile(95)across the pooled per-seek latencies from every run. With 10 seeks × 2 modes × 3 runs we get 30 samples per mode per CI shard, enough for a stable p95.scrub_latency_p95_inline_ms(lower-is-better, baselinescrubLatencyP95InlineMs = 33) andscrub_latency_p95_isolated_ms(lower-is-better, baselinescrubLatencyP95IsolatedMs = 80).05-drift.ts— media sync drift10-video-grid, plays 6 s, instruments everyvideo[data-start]element withrequestVideoFrameCallback. Each callback records(compositionTime, actualMediaTime)plus a snapshot of the clip transform (clipStart,clipMediaStart,clipPlaybackRate).|actualMediaTime − ((compTime − clipStart) × clipPlaybackRate + clipMediaStart)|— the same transform the runtime applies inpackages/core/src/runtime/media.ts, snapshotted once at sampler install so the per-frame work is just subtract + multiply + abs.02-fps.ts: frames captured during the postMessage round-trip would compare a non-zeromediaTimeagainstgetTime() === 0and inflate drift by hundreds of ms.max()andpercentile(95)across the pooled per-frame drifts. The proposal's max-drift ceiling of 500 ms is intentional — the runtime hard-resyncs when|currentTime − relTime| > 0.5 s, so a regression past 500 ms means the corrective resync kicked in and the viewer saw a jump.media_drift_max_ms(lower-is-better, baselinedriftMaxMs = 500) andmedia_drift_p95_ms(lower-is-better, baselinedriftP95Ms = 100).Wiring
packages/player/tests/perf/index.ts: addfps,scrub,drifttoScenarioId,DEFAULT_RUNS, the default scenario list (--scenariosdefaults to all four), and three new dispatch branches.packages/player/tests/perf/perf-gate.ts: adddroppedFramesMax: numbertoPerfBaseline. Other baseline keys for these scenarios were already seeded in perf(player): p0-1a perf test infra + composition-load smoke test #399.packages/player/tests/perf/baseline.json: adddroppedFramesMax: 3..github/workflows/player-perf.yml: three new matrix shards (fps/scrub/drift) atruns: 3. Samepaths-filterand same artifact-upload pattern as theloadshard, so the summary job aggregates them automatically.Methodology highlights
These three patterns recur in all three scenarios and are worth noting because they're load-bearing for the numbers we report:
play()API is async (postMessage), so any samples captured before__player.isPlaying() === truebelong to ramp-up, not steady-state. Both02-fpsand05-driftclear__perfRafSamples/__perfDriftSamplesafter the wait. Without this, fps drops 5–10 and drift inflates by hundreds of ms.performance.timeOrigin + performance.now()for scrub, rAF/RVFC timestamps for fps/drift) rather than host-side. The iframe is what the user sees; host-side timing would conflate Puppeteer's IPC overhead with real player latency.pause()is issued, so the pause command's postMessage round-trip can't perturb the tail of the measurement window.Test plan
bun run player:perfruns all four scenarios end-to-end on the 10-video-grid fixture.baselineKeysoperf-gate.tscan find them.metrics.jsonartifacts.Stack
Step
P0-1bof the player perf proposal. Builds on:P0-1a(perf(player): p0-1a perf test infra + composition-load smoke test #399): the harness, runner, gate, and CI workflow this PR plugs new scenarios into.Followed by:
P0-1c(perf(player): p0-1c live-playback parity test via SSIM #401):06-parity— live playback frame vs. synchronously-seeked reference frame, compared via SSIM, on the existinggsap-heavyfixture from perf(player): p0-1a perf test infra + composition-load smoke test #399.